Distributed and Provably Good Seedings for k-Means in Constant Rounds
نویسندگان
چکیده
The k-means++ algorithm is the state of the art algorithm to solve k-Means clustering problems as the computed clusterings are O(log k) competitive in expectation. However, its seeding step requires k inherently sequential passes through the full data set making it hard to scale to massive data sets. The standard remedy is to use the k-means‖ algorithm which reduces the number of sequential rounds and is thus suitable for a distributed setting. In this paper, we provide a novel analysis of the k-means‖ algorithm that bounds the expected solution quality for any number of rounds and oversampling factors greater than k, the two parameters one needs to choose in practice. In particular, we show that k-means‖ provides provably good clusterings even for a small, constant number of iterations. This theoretical finding explains the common observation that k-means‖ performs extremely well in practice even if the number of rounds is low. We further provide a hard instance that shows that an additive error term as encountered in our analysis is inevitable if less than k−1 rounds are employed.
منابع مشابه
Fast and Provably Good Seedings for k-Means
Seeding – the task of finding initial cluster centers – is critical in obtaining highquality clusterings for k-Means. However, k-means++ seeding, the state of the art algorithm, does not scale well to massive datasets as it is inherently sequential and requires k full passes through the data. It was recently shown that Markov chain Monte Carlo sampling can be used to efficiently approximate the...
متن کاملPassword-Based Group Key Exchange in a Constant Number of Rounds
With the development of grids, distributed applications are spread across multiple computing resources and require efficient security mechanisms among the processes. Although protocols for authenticated group Diffie-Hellman key exchange protocols seem to be the natural mechanisms for supporting these applications, current solutions are either limited by the use of public key infrastructures or ...
متن کاملLower and Upper Bounds for Distributed Packing and Covering
We make a step towards understanding the distributed complexity of global optimization problems. We give bounds on the trade-off between locality and achievable approximation ratio of distributed algorithms for packing and covering problems. Extending a result of [9], we show that in k communication rounds, maximum matching and therefore packing problems cannot be approximated better than Ω(n 2...
متن کاملFooling Views: A New Lower Bound Technique for Distributed Computations under Congestion
We introduce a novel lower bound technique for distributed graph algorithms under bandwidth limitations. We define the notion of fooling views and exemplify its strength by proving two new lower bounds for triangle membership in the Congest(B) model: 1. Any 1-round algorithm requires B ≥ c∆ log n for a constant c > 0. 2. If B = 1, even in constant-degree graphs any algorithm must take Ω(log∗ n)...
متن کاملScalable K-Means++
Over half a century old and showing no signs of aging, k-means remains one of the most popular data processing algorithms. As is well-known, a proper initialization of k-means is crucial for obtaining a good final solution. The recently proposed k-means++ initialization algorithm achieves this, obtaining an initial set of centers that is provably close to the optimum solution. A major downside ...
متن کامل